Sync master with upstream release b6351 #234

jan-service-account · 2025-09-02T00:34:07Z

Updates dev branch with latest release (b6351) from ggml-org/llama.cpp

* sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment

…ml-org#15690) * CUDA: fix build error from ambiguous __half conversions in conv2d Building conv2d with half precision failed because `__half` defines multiple implicit conversion operators (to float, int, short, etc.), causing ambiguous overload resolution when multiplying with float. Introduce a templated `to_float` helper that explicitly converts `__half` via `__half2float`, while passing through float unchanged. Use this helper in conv2d accumulation to ensure unambiguous and correct promotion to float. Fixes some build errors with half-precision kernels on CUDA. ggml-ci * CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion * CUDA: Add missing convert.cuh header * CUDA: remove unnecessary extension in ggml_cuda_cast * CUDA: Address review comment, remove second type template argument

Signed-off-by: Jie Fu <[email protected]>

) * ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…gml-org#14903) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (ggml-org#15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue

Signed-off-by: Jie Fu <[email protected]>

…rg#15115) * Added sve implementation for vec_dot_fp16 Kernel * removed white spaces * Added comment * removed white spaces * changed GGML_F16x_VEC_FMA for code consistency * Update vec.h --------- Co-authored-by: vithulep <[email protected]>

* SVE support for exponential functions Add const notation to variable pg * Update ggml/src/ggml-cpu/vec.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Add const --------- Co-authored-by: Georgi Gerganov <[email protected]>

)

This is a missing interaction between ggml-org#15546 and ggml-org#15652

) * vulkan: use memory budget extension to read memory usage * fix: formatting and names * formatting * fix: detect and cache memory budget extension availability on init * fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available * style: lints

ggerganov and others added 14 commits August 31, 2025 20:41

CANN: Optimize MUL_MAT_ID (ggml-org#15658)

b9382c3

docs : add Hunyuan to models section (ggml-org#15707)

4795c91

Signed-off-by: Jie Fu <[email protected]>

convert : remove redundant code (ggml-org#15708)

4b20d8b

Signed-off-by: Jie Fu <[email protected]>

vulkan: disable large mmv subgroups on older Nvidia GPUs (ggml-org#15717

fec7911

)

vulkan: add missing clamps in new mul_mat_id paths (ggml-org#15702)

35a42ed

This is a missing interaction between ggml-org#15546 and ggml-org#15652

ggml-backend: raise GGML_MAX_SPLIT_INPUTS (ggml-org#15722)

5d804a4

jan-service-account merged commit 2a87e1c into dev Sep 2, 2025
17 checks passed

jan-service-account deleted the update-dev-from-master-2025-09-02-00-34 branch September 2, 2025 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b6351 #234

Sync master with upstream release b6351 #234

Uh oh!

jan-service-account commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Sync master with upstream release b6351 #234

Sync master with upstream release b6351 #234

Uh oh!

Conversation

jan-service-account commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants